2.2 DataFrame

Content:

  • 2.2.1 DataFrame Structure
  • 2.2.2 Working with Columns
  • 2.2.3 Working with Rows
  • 2.2.4 Conditional Selection
  • 2.2.5 Case Study: Olympic Games

A DataFrame is a two dimensional data structure with columns of potentially different data types. It can be considered like a table or a spreadsheet, similar to R's data.frame.

A DataFrame has both row index and column index.


In [1]:
import numpy as np
import pandas as pd

2.1.1 DataFrame Structure

Initializing a Dataframe.


In [2]:
# Create a DataFrame from dictionary
data = {'Region':['Central','East','North','North-East','West'],
        'Area':[132.7,93.1,134.5,103.9,201.3],
       'Population':[939890,693500,531860,834450,903010]}
df = pd.DataFrame(data)
# Display df
df


Out[2]:
Area Population Region
0 132.7 939890 Central
1 93.1 693500 East
2 134.5 531860 North
3 103.9 834450 North-East
4 201.3 903010 West

In [3]:
# Note that the row index are assigned automatically and column index are arranged in alphabetical order.
# Rearrange columns
df = pd.DataFrame(data,columns=['Region','Population','Area'])
df


Out[3]:
Region Population Area
0 Central 939890 132.7
1 East 693500 93.1
2 North 531860 134.5
3 North-East 834450 103.9
4 West 903010 201.3

In [4]:
# Display columns names
df.columns


Out[4]:
Index([u'Region', u'Population', u'Area'], dtype='object')

In [5]:
# Display all values
df.values


Out[5]:
array([['Central', 939890L, 132.7],
       ['East', 693500L, 93.1],
       ['North', 531860L, 134.5],
       ['North-East', 834450L, 103.9],
       ['West', 903010L, 201.3]], dtype=object)

In [6]:
# Get the number of rows and columns of the dataframe i.e. its shape
df.shape


Out[6]:
(5, 3)

In [7]:
# Size of DataFrame = row x column
df.size


Out[7]:
15

In [8]:
# Number of rows
len(df)


Out[8]:
5

In [11]:
# Programming specific information of the dataframe
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
Region        5 non-null object
Population    5 non-null int64
Area          5 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 192.0+ bytes

In [9]:
# Statistical description of numerical columns
df.describe()


Out[9]:
Population Area
count 5.00000 5.000000
mean 780542.00000 133.100000
std 167865.50828 42.148547
min 531860.00000 93.100000
25% 693500.00000 103.900000
50% 834450.00000 132.700000
75% 903010.00000 134.500000
max 939890.00000 201.300000

2.2.2 Working with columns


In [12]:
# Rename a column label
df = df.rename(columns={'Population':'Pop'})
df


Out[12]:
Region Pop Area
0 Central 939890 132.7
1 East 693500 93.1
2 North 531860 134.5
3 North-East 834450 103.9
4 West 903010 201.3

In [13]:
# Select a single column to series
A = df['Area'] # same answer as df.Area
A


Out[13]:
0    132.7
1     93.1
2    134.5
3    103.9
4    201.3
Name: Area, dtype: float64

In [14]:
# Select a single column to dataframe
B = df[['Area']]
B


Out[14]:
Area
0 132.7
1 93.1
2 134.5
3 103.9
4 201.3

In [15]:
# Select multiple columns to dataframe
C = df[['Area','Pop']]
C


Out[15]:
Area Pop
0 132.7 939890
1 93.1 693500
2 134.5 531860
3 103.9 834450
4 201.3 903010

In [16]:
# Change order of columns
D = df[['Region','Area','Pop']]
D


Out[16]:
Region Area Pop
0 Central 132.7 939890
1 East 93.1 693500
2 North 134.5 531860
3 North-East 103.9 834450
4 West 201.3 903010

In [17]:
# Drop a column by label
E = df.drop('Area',axis=1)
E


Out[17]:
Region Pop
0 Central 939890
1 East 693500
2 North 531860
3 North-East 834450
4 West 903010

In [21]:
# Create a new column 'Density' = 'Population'/'Area'
df['Density'] = df['Pop']/df['Area']
df


Out[21]:
Region Pop Area Density
0 Central 939890 132.7 7082.818387
1 East 693500 93.1 7448.979592
2 North 531860 134.5 3954.349442
3 North-East 834450 103.9 8031.280077
4 West 903010 201.3 4485.891704

In [22]:
# Sort values
df.sort_values(by=['Pop'], ascending=False)


Out[22]:
Region Pop Area Density
0 Central 939890 132.7 7082.818387
4 West 903010 201.3 4485.891704
3 North-East 834450 103.9 8031.280077
1 East 693500 93.1 7448.979592
2 North 531860 134.5 3954.349442

In [23]:
# Find index label for max/min values
df['Density'].idxmax()


Out[23]:
3

Exercise: Which region has the highest density? Can you get the answer without sorting?


In [23]:
# method 1: 
df['Region'][df['Density'].idxmax()]


Out[23]:
'North-East'

In [24]:
# method 2: Change the index
df1 = df.set_index('Region')
df1['Density'].idxmax()


Out[24]:
'North-East'

There are many commonly used column-wide methods/attributes:

  • df['col'].size
  • df['col'].count()
  • df['col'].sum()
  • df['col'].max()
  • df['col'].mean()
  • df['col'].std()

In [25]:
# Get all numerical summaries of a column
df['Pop'].describe()


Out[25]:
count         5.00000
mean     780542.00000
std      167865.50828
min      531860.00000
25%      693500.00000
50%      834450.00000
75%      903010.00000
max      939890.00000
Name: Pop, dtype: float64

2.2.3 Working with rows


In [26]:
# Select multiple rows
df[2:4]


Out[26]:
Region Pop Area Density
2 North 531860 134.5 3954.35
3 North-East 834450 103.9 8031.28

In [27]:
# Select the last row
df[-1:]


Out[27]:
Region Pop Area Density
4 West 903010 201.3 4485.89

In [28]:
# Select all but last row
df[:-1]


Out[28]:
Region Pop Area Density
0 Central 939890 132.7 7082.82
1 East 693500 93.1 7448.98
2 North 531860 134.5 3954.35
3 North-East 834450 103.9 8031.28

In [29]:
# Select all even rows
df[::2]


Out[29]:
Region Pop Area Density
0 Central 939890 132.7 7082.82
2 North 531860 134.5 3954.35
4 West 903010 201.3 4485.89

In [30]:
# Select by .iloc
df.iloc[0:2,1:3]


Out[30]:
Pop Area
0 939890 132.7
1 693500 93.1

In [31]:
# Select by .loc
df.loc[0:1,['Pop','Area']]


Out[31]:
Pop Area
0 939890 132.7
1 693500 93.1

2.2.4 Conditional Selection


In [24]:
# Boolean masking
df['Pop']>800000


Out[24]:
0     True
1    False
2    False
3     True
4     True
Name: Pop, dtype: bool

In [25]:
# Select rows by boolean masking
df[df['Pop']>800000]


Out[25]:
Region Pop Area Density
0 Central 939890 132.7 7082.818387
3 North-East 834450 103.9 8031.280077
4 West 903010 201.3 4485.891704

In [26]:
# Boolean masking with ==
df['Region']=='Central'


Out[26]:
0     True
1    False
2    False
3    False
4    False
Name: Region, dtype: bool

In [27]:
# Select rows by boolean masking
df[df['Region']=='Central']


Out[27]:
Region Pop Area Density
0 Central 939890 132.7 7082.818387

In [28]:
# Using .loc to find the Area of the Central region.
df.loc[df['Region']=='Central', 'Area']


Out[28]:
0    132.7
Name: Area, dtype: float64

In [29]:
# Multiple conditions (and: &) (or: |)
(df['Pop'] < 800000) & (df['Density']<8000)


Out[29]:
0    False
1     True
2     True
3    False
4    False
dtype: bool

In [30]:
# Select rows by multiple conditions
df[(df['Pop'] < 800000) & (df['Density']<8000)]


Out[30]:
Region Pop Area Density
1 East 693500 93.1 7448.979592
2 North 531860 134.5 3954.349442

In [31]:
# Using .query method

df.query("Pop < 800000 & Density < 8000")


Out[31]:
Region Pop Area Density
1 East 693500 93.1 7448.979592
2 North 531860 134.5 3954.349442

2.2.5 Case Study: Olympic Games

We can import data from a csv file by using pd.read_csv. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

We can import data from an Excel file by using pd.read_excel. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

Source of data: https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table


In [32]:
# Import data from Excel file
og = pd.read_excel('OlympicGames.xlsx')
# Display first 5 rows of dataframe
og.head()


Out[32]:
Country Games Gold Silver Bronze
0 Afghanistan 14 0 0 2
1 Algeria 13 5 4 8
2 Argentina 24 21 25 28
3 Armenia 6 2 5 7
4 Australasia 2 3 4 5

Q1: How many rows and columns are there in the dataframe?


In [33]:
og.shape


Out[33]:
(151, 5)

Q2: Are there any missing values in the dataframe?


In [34]:
#og.isnull().sum()
og.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 151 entries, 0 to 150
Data columns (total 5 columns):
Country    151 non-null object
Games      151 non-null int64
Gold       151 non-null int64
Silver     151 non-null int64
Bronze     151 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.1+ KB

Q3: Create a new column for total number of Olympic medals.


In [35]:
og['Total'] = og['Gold']+og['Silver']+og['Bronze']
og.head()


Out[35]:
Country Games Gold Silver Bronze Total
0 Afghanistan 14 0 0 2 2
1 Algeria 13 5 4 8 17
2 Argentina 24 21 25 28 74
3 Armenia 6 2 5 7 14
4 Australasia 2 3 4 5 12

Q4: Select the row where Country = Singapore.


In [36]:
og[og['Country']=='Singapore']


Out[36]:
Country Games Gold Silver Bronze Total
116 Singapore 16 1 2 2 5

Q5: Which country has won the highest number of Gold medals?


In [38]:
#og[og['Gold'] == og['Gold'].max()]['Country']
og['Country'][og['Gold'].idxmax()]


Out[38]:
u'United States'

Q6: How many countries participated in at least 25 Olympic games (100 years)? Return an integer.


In [40]:
len(og.query("Games >= 25").loc[:, "Country"])


Out[40]:
16

Q7: Which are the top 3 countries with highest total Olympic medals?

Challenge! Can you return a list of countries in one line of code?


In [48]:
og.sort_values(by='Total', ascending=False)[0:3]['Country'].values


Out[48]:
array(['United States', 'Soviet Union', 'Great Britain'], dtype=object)

Q8: Out of the countries which have not won any Gold medals, which country has won the highest number of medals?

Challenge! Can you return the name of the country in one line of code?


In [42]:
og['Country'][og[og['Gold']==0]['Total'].idxmax()]
og.Country[og.query("Gold == 0").Total.idxmax()]


Out[42]:
u'Malaysia'

In [ ]: